Notebook

Assignment 02 - MySQL¶

DBMS for Analytics

Due: Wednesday, March 3rd, at midnight

Notes:

All code should follow the PEP 8 Style Guide for Python
Assignment should be submitted using jupyter notebooks
File name should follow “{Your Name} – Assignment_XX_Submission”
Each Task should be contained in its own cell
Each Task should be properly commented
Each Task should print out the answer to the Task if appropriate
Sample submission can be found on blackboard
Not following these standards may result in lost points

Assignment Description¶

For this assignment we are going to explore how the different structures for Mongo and SQL lead to different optimizations on how we can query data.

In [1]:

from pymongo import MongoClient

client = MongoClient(host='18.219.151.47', #host is the hostname for the database
                     port=27017, #port is the port number that mongo is running on
                     username='student', #username for the db
                     password='emse6992pass', #password for the db
                     authSource='emse6992') #Since our user only exists for the emse6992 db, we need to specify this

db = client.emse6992
stats_coll = db.twitter_statuses

In [2]:

import pymysql

conn = pymysql.connect(host='gwumysql-restore.cazdwdlcg6dm.us-east-2.rds.amazonaws.com',
                             user='EMSE6992',
                             port=3306,
                             db='EMSE6992',
                             password='MySQLpass')

Task 1¶

Identify the set list (no duplicates) of users who have retweeted a statuses made by 'elonmusk' using both MySQL and Mongo. For both MySQL and Mongo ensure that your cell prints out the number of unique users.

Expected Outputs:

MySQL: 312 users
Mongo: 312 users

Note: Mongo's distinct command can take a second parameter to filter prior to running distinct - db.<collection_name>.distinct(<distinct_field>, {<filter>})

In [3]:

# Space for Mongo Implementation
retw_coll = db.twitter_retweets
task_one_mongo = retw_coll.distinct(('user.id'), {'retweeted_status.user.screen_name':'elonmusk'})
print(f'Mongo: {len(task_one_mongo)} users')

Mongo: 312 users

In [4]:

# Space for MySQL implementation
cur = conn.cursor()

task_one_MySQL = cur.execute("""
SELECT distinct user_id FROM retweets 
    WHERE retweeted_status in (
    SELECT distinct status_id FROM statuses 
    WHERE user_id = (SELECT user_id FROM users WHERE screen_name = 'elonmusk')
    )
""")

print(f'MySQL: {task_one_MySQL} users')

MySQL: 312 users

Task 2¶

Identify the set list (no duplicates) of all users using Mongo and MySQL. For both MySQL and Mongo ensure that your cell prints out the number of unique users.

Expected Outputs:

MySQL: 86202 users
Mongo: 102812 users
- NOTE: your number may differ for Mongo depending on approach and thoroughness

In [5]:

# Space for Mongo Implementation
favor_coll = db.twitter_favorites
friend_coll = db.twitter_friends
lists_coll = db.twitter_lists

favor_num = favor_coll.distinct('user.id')
friend_num = friend_coll.distinct('id')
list_num = lists_coll.distinct('user.id')
retw_num = retw_coll.distinct('user.id')
stats_num = stats_coll.distinct('user.id')

final_list = favor_num+friend_num+list_num+retw_num+stats_num

def get_unique_numbers(numbers):

    list_of_unique_numbers = []
    unique_numbers = set(numbers)
    for number in unique_numbers:
        list_of_unique_numbers.append(number)

    return list_of_unique_numbers

In [6]:

task_two_Mongo = get_unique_numbers(final_list)
print(f'Mongo: {len(task_two_Mongo)} users')

Mongo: 104268 users

In [7]:

# Space for MySQL implementation

task_two_mysql = cur.execute("Select distinct user_id From users")
print(f'MySQL: {task_two_mysql} users')

MySQL: 86202 users

Task 3¶

For each of the previous tasks provide a couple of sentences identifying which database was easier to work with and detailing why that database's structure simplified the task.

Answer the prompt regarding Task 1:

Regarding to task one, I believe Mongo database is easier to work with. Since the prompt on this specific question is asking us to find the list of users who have retweeted a post that was originally from Elon Musk. After exploring the database on Mongo, we discover that there is a ‘twitter_retweets’ table that has all the retweets information. And from this table, there is a group called ‘retweeted_status’ that contains all the information about the original tweet where the user retweeted from. So, by finding the user’s screen name equal to ‘elonmusk’ from the ‘quoted_status’ under the group of ‘retweeted_status’, we will be able to find all the list of users who have retweeted Elon Musk. On the other hand, SQL also has the 'retweet' table but does not contain any information from the original post. We will then inner join with the status and users table to obtain the users information. And finally we need to provide a condition statement to check the final list of unique users. It appears that Mongo will be easier in completing this task.

Answer the prompt regarding Task 2:

Regarding to task two, I believe SQL database is easier to work with. Since the prompt is asking us to find all the distinct users from the database and by discovering the SQL database, we know that SQL database has a table called ‘users’ that contains all the users’ information. So, by conducting a single line query on SQL, we are able to find all the distinct/unique value in the table by the primary key ‘user_id’, which it gives the final result as a list of users in the database with no duplicates. However, the Mongo database does not have a table that has all the user’s information. It will require to add up all the users from all the five different databases and to find the unique value from it. Therefore, it appears the SQL will be easier to complete this task.

Task 4¶

How would you modify our SQL database to handle hashtags? Please provide information regarding what tables you would modify/create and how these modifications would enable the support for Hashtags.

Explanation:

First in order to allow to SQL database to handle hashtags, we need to modify the 'statuses' table in SQL. In the 'statuses' table, first we create a column or a field and name it as hashtag. In this column, we capture the hashtags’ value from each single post from the table. Since the hashtags’ value appear in the text column in the 'statuses' table, we store the string value starting behind the symbol “#” and ending before space since the format of hashtag is an unbroken word or phrase. If there is no hashtag from the post, we will have a null value in the hashtag field. If there are more than one hashtag value capture on a single, we can store them in the same cell of the hashtag field separated by commas.

The modification that we have created would be benefic for searching the hashtags and finding the relationship between hashtags and their posts. It is also an efficient way to search since we do not need to create any additional tables that would causing more space for the database to store things. Regarding to the hashtag searching, we can conduct queries include the ‘like’ condition to search the desire the hashtag value that we are looking for. After executing the query, we will have all the posts’ information that contain the hashtag. And further by inner joining the ‘users’ table, we can also find the users’ information regarding the original post that has the hashtag.

Task 5¶

If twitter decided to add the ability to customize tweets font/colors/etc., how would you implement this in SQL vs Mongo. Which do you think is easier to setup/easier to maintain.

Explanation:

If twitter adds the ability of customizing tweet’s font and color, I would modify the ‘statuses’ table in SQL in order to help supporting this function. In the ‘status’ table, we can create a column or a field and name it as ‘font_style’. In this field, it captures the font style of each single post from the ‘status’ table. And likewise, for the colors we can create another column or field with name ‘color’. And it would be interesting and more meaningful for the perspective of analytics, if twitter can set up rules like different colors represent different mood or status of the users when posting the tweets. So, the ‘color’ field stores the color of each tweets in the ‘statuses’ table.

For Mongo database, I would modify the ‘twitter_statuses’ table and‘twitter_retweets’ table since both tables contain the information of the posts. For the ‘twitter_statuses’ table, I would add two fields with the name of ‘font_style’ and ‘color’ underneath the field ‘text’. For ‘twitter_retweets’ table, I would add these two new fields in the group of ‘retweeted_status’ since the color and font style belong to the original post’s information. I would also add both the fields of 'font_style' and 'color' outside of the 'retweeted_status' under the field 'text'. This is because when someone is retweeting someone else's post, he/she can also comment on the post which will also need to have these two fields to record the value.

I would say the implement on SQL will be a little bit easier than implementing on Mongo database. Since there is only one table in SQL that contains the specific information about each tweets/post, it will be easier to set up and to maintain by only monitoring this table. For Mongo on the other hands, it will require to set up and maintain on more than one table which will cause more time consuming and more human efforts. Also, The way to construct the database between SQL and Mongo varies a lot. Mongo has total of five different tables which has a lot more complex structure than the SQL database does. So when encountering problems like adding new fields to the database, Mongo will be relatively difficult than the SQL database.